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(57) Abstract 

The present invention provides a multiprocessor com- 
puter system and method for parallel processing of scalar 
operations. The computer system includes a single program 
counter for multiple arithmetic computational modules 
(ACMs) which are coupled in parallel by a bus (56a-c). 
Each ACM (48) includes an instruction memory (74), op- 
erand memories for providing operands to an ALU (72) 
which performs scalar operations and which generates a 
condition code. Additionally, each ACM (48) includes a 
multiplexer for selecting, based on the condition code, be- 
tween the ALU (72) output, stored operands and constants 
and, thus, providing a data-formatted output Also within 
each ACM is a switch network (64) for routing the data-for- 
matted output to the bus or back to the ALU for subsequent 
scalar operations. 
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MULTIPROCESSOR COMPUTER SYSTEM AND METHOD FOR PARALLEL 
PROCESSING OF SCALAR OPERATIONS 

A portion of the disclosure of this patent 
document contains . material which is subject to copyright 
protection. The copyright owner has no objection to the 
facsimile reproduction by anyone of the patent document 
or the patent disclosure, as it appears in the Patent and 
Trademark Office patent file or records, but otherwise 
reserves all copyright rights whatsoever. 

A microfiche appendix is included in this 
application containing two microfiche. Microfiche number 
one (PHASE 1) contains 55 frames plus one test target 
frame, for a total of 56 frames. Microfiche number two 
(PHASE 2) contains 33 frames plus one test target frame 
for a total of 34 frames. 

FIELD OF THE INVENTION 

The present invention relates generally to computer 
systems and methods which use parallel processing 
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techniques and, more particularly,, to multiprocessing 
computer systems and methods designed for parallel 
processing of scalar operations within simulation 
applications. 

5 BACKGROUND OF THE INVENTION 

Simulation is an invaluable tool for the evaluation 
and development of all types of systems including 
mechanical (e.g., robotic arm) , chemical (e.g. r chemical 
reactions), electrical (e.g., application specif ic 
10 integrated circuits), etc. The simulation of some 

systems can be a computationally intensive task entailing 
many scalar operations . 

Attempts have been made to perform such simulations 
with vector machines which are efficient when they can be 

15 configured to perform an instruction on vectors (large 
groups) of data. However, some systems do not lend 
themselves to vectors of data, instead they require many 
scalar operations. In this type of application, vector 
machines are not efficient and do not benefit from their 

20 vector- oriented design. 

An alternate architecture to the vector machines for 
these applications is the single -bus multiprocessor 
AD100. In its basic form, this system consists of five 
devices attached to a single bust an adder, a multiplier, 

25 a data memory, a control processor, and a host interface. 
In this architecture, all processors communicate via the 
same bus which increases the probability of bus 
contention with consequent loss of speed. Additional 
drawbacks include 1) the operations are not pipelined, 2) 

30 there is no critical -path scheduling and 3) there is no 
provision for expanding beyond a single AD100 system. 
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Another parallel -processing architecture to be 
considered is MultiFlow, Inc.'s TRACE supercomputer. The 
TRACE system includes seven functional units: two 
integer, two floating point, two load/store (memory 
5 access) and a branch- logic controller. In addition, the 
TRACE architecture allows up to four such systems to be 
interconnected for a total of 28 functional units. 
However, because the TRACE system has a single memory for 
each 7 -unit system, it is capable of only two memory 
10 references per clock cycle. 

Additionally, . the compiler for the TRACE system 
operates by using a technique called "trace compacting". 
This technique handles conditional jumps by guessing 
which branch is most likely to be taken and generating 
15 code based on the guess. Then, if a different branch is 
actually taken at run- time, the compiler has prepared 
"fix-up" code to correct the resulting error. 

SUMMARY OF THE INVENTION 

A multiprocessor computer system having a bus for 
20 parallel processing of scalar operations. The computer 
system further includes a single program counter and a 
plurality , of arithmetic computational modules (ACMs) . 
Each ACM includes an instruction memory responsive to the 
single program counter, operand memories for providing 
25 operands to an ALU which performs scalar operations and 
provides a condition code. Additionally, each ACM 
includes a multiplexer for selecting between the ALU 
output, stored operands and constants as a function of 
the condition code and providing a data- formatted output. 
30 A switch network, provided in each ACM, routes the data- 
formatted output to the bus or back to the operand 
memories- and/ or ALU for subsequent scalar operations. 
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Another aspect of the present invention is a method 
for transforming mnlti- input (more than two) operations 
into two -input operations such that the same result is 
obtained and the critical path is minimized. The method 
5 includes the steps of calculating the earliest finish 

time (EFT) for each variable and creating the two -input 
operations based on the lowest EFTs. 

Another aspect of the present invention is a method 
for scheduling the operations to be performed by an ACM. 

10 The method comprises the steps of calculating the down- 
stream time (DST) for each variable and , as operations 
become available to be performed (due to the availability 
of its operands) , the operations are scheduled; however, 
if more than one operation is ready than the one with the 

15 greatest DST is chosen. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is best understood from the following 
detailed description when read in connection with the 
accompanying drawings, in which: 

20 Fig. 1 is a dataflow diagram of a prior art analog 

system; 

Fig. 2A is a high-level functional block diagram of 
Fig. 1 implemented with analog technology; 

Fig. 2B iB a high-level functional block diagram of 
25 Fig. 2 A implemented with digital technology; 

Fig. 2C is a high-level functional block diagram of 
Fig. 2B implemented with digital technology and designed 
to time -multiplex components; 

Fig. 2D is a high-level functional block diagram of 
30 Fig. 2.C implemented with digital technology^ and- designed 
to perform all functions by time -multiplexing; 
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Fig. 3 is a high-level functional block diagram of a 
system incorporating the present invention which employs 
a plurality of Arithmetic Control Modules (ACM 1 s) in 
parallel; 

Pig- 4 is a high-level functional block diagram of 
an individual processing element or arithmetic 
computational module (ACM) ; 

Fig. 5 is a detailed functional block diagram of an 
a portion of the ACM in Fig. 4; 

Fig. 6A and 6B together show a detailed functional 
block diagram of the ACM in Fig. 4; 

Fig. 7A is a low-levdl dataflow diagram depicting 
how a typical prior art compiler schedules a task; 

Fig. 7B is a low- level dataflow diagram depicting 
15 how a compiler incorporating an aspect of the present 
invention schedules the Same task as shown in Fig. 7A; 
and 

Fig. 8 is a flow chart depicting how a compiler 
incorporating an aspect of the present invention 
determines how to schedule tasks like the one in Fig. 7B. 



10 



20 



DETAILED DESCRIPTION OF THE INVENTION 

Evolution of Digitally- Implemented Analog Computer 

A multiprocessor computer system is shown which 
benefits from the advantages derived from both analog and 

25 digital computers. An analog computer is a fully 

parallel device. For example, given the second order 
polynomial Y=A+B*X+C*X**2 (or Y=A+X* (B+C*X) ) , the 
dataflow diagram, seen in Fig. l, shows that evaluating 
this expression requires two multiplications and two 

30 additions. Given that, the equations to be solved require 
two multiplications and two additions, as seen in Fig 2A, 



an analog circuit employs two multipliers, 22 and 26, and 
two adders, 20 and 24. While this approach offers the 
ultimate in speed, it also becomes expensive for large 
applications . 

In contrast, a traditional digital computer employs 
a single arithmetic unit which is used repeatedly --as 
many times as the equations (or program) may require. In 
the example given above, the same arithmetic unit is used 
four times to perform the desired calculation. 

Direct replacement of analog components by their 
digital counterparts yields, a "fully-parallel" digital 
system, with one component per mathematical operation as 
seen in Fig. 2B the adders are 30 and 34, the 
multipliers are 32 and 36, and the analog patch panel 28 
from Fig. 2 A is replaced with a digital switch network 
and memoir 38 • Although this digital architecture is 
capable of the fastest possible speed, it suffers from 
the same cost disadvantages as its analog predecessor. 

Furthermore, such a digitally implemented analog 
architecture is time -wise inefficient because most of the 
components are idle most of the time* As mentioned, a 
fully parallel implementation requires two multipliers 
and two adders; but, there is no way all four components 
can operate simultaneously. 

The dataflow diagram in Fig. 1 reveals that the 
first multiplication must finish before the first add can 
begin, and this must finish before the second 
multiplication can begin, and so on.. In other words, 
regardless of the hardware configuration, the time 
required to evaluate this expression is at least two add 
times plus two multiply times. This limitation is known 
as "precedence constraints" or "data precedence". And it 
is this concept of precendence constraints or critical 
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path which limits the amount of parallelism available in 
an application. 

Viewed from another perspective, it is the concept 
of precedence constraints which also creates an 
5 opportunity for economizing hardware by re -using 

components. Referring back to the configuration in Fig. 
2B, given the above-mentioned precedence constraints, the 
same adder 30 and the same multiplier 32 can be used for 
each of the two operations as seen in Pig. 2C. In fact, 
10 given a component that could perform a host of operations 
including an add and multiply, the entire equation could 
be solved with one component^ such as an ALU 39 as seen 
Pig. 2D. 

The above evolution (illustrated in Figs. 2A through 
15 2D) merges the attractive aspects of both the analog and 
digital world. 

Overall system ArpMtecfryye 

Fig. 3 shows a high-level block diagram of the 
exemplary embodiment of a system which incorporates the 

20 present invention. A system user accesses the system via 
a workstation 40. A host computer 42 compiles source 
code, written in CSSL, which is a widely-used language 
for programming continuous system simulations. CSSL is a 
Simulation language standard of the Society for Computer 

25 Simulation. Since CSSL is a parallel language for 

describing parallel physical systems, it is well-suited 
for programming a parallel machine like the present 
invention. 

The compiler, using the parallel CSSL code, 
30 schedules the instructions to be run on the available 

arithmetic computational modules (ACM's) 48 in the 
system. The method of scheduling is itself an aspect of 
the present invention which is discussed in detail below. 
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Then host 42 loads the information directly into the 
setup interface 44. It should be noted that a system 
incorporating the present invention is not a stand-alone 
computer system* It can he characterized as a peripheral 
5 system to a conventional host for which the host is 
needed for preparatory functions prior to running 
simulations on the system. 

Next, setup interface 44 via bus 56a- c loads 
controller 46 , ACM's 48 and I/O interfaces 50 with their 

10 respective instructions. The actual running of the 

instructions is controlled by a single program counter 
which resides in the controller 46. The controller 46 
contains a program counter, an address stack, an 
instruction memory and register, and miscellaneous 

15 control logic. Although branching is relatively rare in 
a system incorporating the present invention, if a 
programmed application requires that it happen, the 
controller 46 by its design can accommodate this need. 

Although Pig. 3 shows a single bus, designated 56a- 
20 c, connecting the setup interface 44, the controller 46, 
and the ACM l s 48, it is actually three different busses. 
The first bus 56a is for setup/diagnostic purposes which 
connects the setup interface to the controller 46 r to the 
ACM's 48 and to the I/O interfaces 50. It is this bus 
25 56a which is used to load the individual instruction 
memories and data memories for the controller 46 , the 
ACM's 48 and the I/O interfaces 50. Besides loading 
instructions, this bus 56a is also used for conducting 
system diagnostics. 

30 The second bus 56b is a data bus which is used for 

data transfer among the ACM's 48 and the I/O interfaces 
50. This bus 56b is much faster than the setup bus 
because the speed of run time is more crucial than the 
speed of setup/diagnostic time. 
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And, the third bus 56c is the program counter bus on 
which the controller 46 broadcasts the program count to 
each of the ACM ! s 48 and I/O interfaces 50 in the system. 
Again, because there is only one program counter for all 
5 of the active processors, ACM 1 s 48 and I/O interfaces 50, 
the system is not well -suited for handling branches. 
This architectural decision was conscious and a result of 
balancing of the design's gained efficiency in the 
parallel processing of scalar operations versus the 
10 drawbacks of not branching easily. Note that an obvious 
branch that the system must be able to handle is the end 
of program branch --at the end of a program the system 
must be able to start back at the beginning. 

In addition to the ACM's 48 seen in Fig. 3, the 

15 system contains I/O interface modules 50. The I/O 

interface modules 50 are also connected to each of the 
three busses 56a, 56b and 56c for receiving instructions, 
transferring data, and accessing the program count, 
respectively. These modules are for digital and analog 

20 interfaces and are controlled by the same program counter 
that controls the ACM's 48. It is important that the 
computational processors, ACM's 48, as well as the I/O 
processors 50, for the present invention be controlled by 
the same program counter because the applications 

25 targetted by this system require that data computations 
and data transfers be time deterministic. And, for the 
compiler/scheduler to not only take maximum advantage of 
the fine-grained parallelism but also resolve potential 
bus conflicts prior to a program's execution, it must 

30 know exactly when such computations and transfers will 
occur. 

An example of how the I/O interface 50 may be used 
is for "hardware -in -the -loop" simulations. These are 
real-time simulations where portions of the simulation 
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are carried out using a simulator, such as the present 
invention, and the remaining portions are carried out 
with the pieces the actual device (s) being simulated. 
The data being transfered between the simulator and the 
5 actual device (s) must be speedy, accurate, and precisely 
timed. Because the present invention compiles/schedules 
instruction execution aad data transfer prior to a 
program run, it can easily provide the precise timing. 

The remaining block of Fig. 3 is the run- time 
10 interface 52. The function of the run- time interface 52 
is to allow data to be accessed and subsequently 
displayed on a scope or screen while simulations are 
taking place. 

ACM Architecture 

15 Fig. 4 shows a high-level functional block diagram 

of an ACM 48 consisting of an input register 62 and an 
output FIFO 76 both coupled to bus 56b for receiving and 
sending data, respectively r external to ACM 48. Also 
included in the ACM 48 is an indexed memory 66 and two 

20 operand memories €8a and 68b which provide the data 

inputs to ALU 72. In the exemplary embodiment of the 
present invention r the indexed memory is a 128Kx32 RAM 
and the operand memories 68a and 68b each include a pair 
of 1Kx32 RAMs {see Fig. 6B) . The pair of 1Kx32 RAMs acts 

25 like a dedicated dual -port memory by allowing two values 
to be fetched from two of the RAMs and two values to be 
stored in the other two RAMs Ca total of four memory 
references) per clock cycle. 

The data outputs from input register 62, indexed 
30 memory 66 and ALU 72 are fed into a digital switch 

network 64, consisting of a group of strategically placed 
multiplexers, which provides flexible, internal dataflow 
trafficking within the ACM 48. The switch network 64, in 
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turn, supplies data inputs to the output FIFO 76 (for 
outputting to the bus 56b) , the indexed memory 66 (for 
indexing functions) and the operand memories 68a and 68b 
(for subsequently supplying operands to ALU 72) . As 
indicated by the arrows, all of the above functional 
blocks in the ACM 48 are controlled by a wide instruction 
word represented by block 74 which is actually an 
instruction memory and register. 

Note that some of the functional blocks shown in 
Fig, 4 are labelled with their primary function and the 
additional descriptive phrase "AND LOGIC". This is done 
to better correlate this level diagram with the lower- 
level diagram discussed below. 

Individual Functions: SWITCH. COMPARE, MAX/MIN 

Fig. 5 shows additional details of the exemplary 
embodiment of an ACM 48. Fig. 5 shows two pairs of RAMs 
88a, 88b and 90a, 90b, respectively. In the exemplary 
embodiment of the present invention, each of the RAMs 
88a, 88b, 90a and 90b is a 1Kx32 RAM. The inputs to 
these RAMs, lines 102a and 104a, are provided by the 
switch network (although not shown in Fig. 5, it is 
element 64 in Fig. 4) . The input to the respective pairs 
of RAMs 102a, 104a and the outputs of the RAMs are then 
fed to their respective muxes 92a and 92b. The outputs 
of the muxes 92a and 92b are then fed into registers 80a 
and 80b, respectively. The outputs of registers 80a and 
80b are then fed to registers 82a and 82b as well as to 
ALU 86. The outputs of registers 82a and 82b and ALU 86 
are then fed to mux 84. 

It should be noted that the dotted lines indicate to 
which functional block, from Fig. 4, that the additional 
details belong. For instance, register 80a and 80b 
reside in the operand memory functional blocks 68a and 
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68b, respectively; while registers 82a, 82b, multiplexer 
84 and ALU 86 reside in ALU functional block 72 . 

It should also be noted, as seen in Fig. 5, that mux 
84 also has a "1" and a "0" input such that a "1" or a 
5 11 0" can be generated by any function which may need these 
values as outputs. For example, a hardware comparator 
function (COMPARE) with data output as opposed to 
condition code output. 

The COMPARE function is performed by loading the 

10 values to be compared in registers 80a and 80b. ALU 86 
is instructed, via the instruction register (shown in 
Fig. 4 and Figs. 6A and SB) , to perform one of the 
compare operators Ce.g*, <, >, =, etc.). The typical 
output of this operation is a condition code from ALU 86, 

15 via line 83, which, in the exemplary embodiment of the 

present invention, is used as part of the mux select for 
mux 84 instead of feeding branching logic as in 
traditional architectures. The condition code, now part 
of the mux select, then selects either the n l w or »0 n 

20 input to mux 84 depending on the result of the operation. 
Thus, the result is a hardware comparator function which 
produces a data formatted output rather than a condition 
code formatted output. It should be noted that the hard 
wired *l" and "0" axe data formatted representations of 

25 the logical variables . In the exemplary embodiment of 
the present invention, the data format comprises a 
thirty- two bit data line, in which the dard wired n l n is 
represented by thirty- two "1 n s and the hard wired n 0" is 
represented by thirty- two n 0*s. 

30 Another function that ACM 48 performs in hardware is 

a SWITCH function. This too is efficiently implemented 
using the four registers 80a, 80b, 82a, and 82b and the 
multiplexer 84 surrounding the ALU 86. The SWITCH 
function has been designed to take the place of certain 



WO 93/21577 



PCT/US93/03165 



- 13 - 

conditional branches, for example, an IF - THEN- ELSE 
statement. For a system incorporating the present 
invention, this is most efficiently accomplished in 
hardware. This added functionality is necessary, as 
mentioned, because the present system is not well-suited 
to handle branching and ALU 8G does not provide this type 
of functionality. To explain how this hardware 
implements a SWITCH, a simple IF-THEN- ELSE example is 
used. 

Given the statement 
IF X THEN 

Y = A, 

ELSE 

Y = 2 

(which is represented in SWITCH form as 
Y = SWITCH (X, A, Z) , 

on the first cycle the values A and Z are loaded 
into the first pair of registers, 80a and 80b, On the 
next cycle, X and 0 are loaded into the first pair of 
registers, 80a and 80b, and A and Z are shifted to the 
second pair of register, 82a and 82b. (It should be 
noted that the variable X could have been X=W<0 which was 
processed by a COMPARE function prior to being used by 
the SWITCH) . At the same time, the ALU 86 is programmed 
by the instruction register (not shown) to perform the 
greater than (>) operation on its held inputs X and 0. 
The result of this compare is supplied by the control 
lines of the ALU 86 and, in turn, is used as a mux select 
for determining which value, A or Z, should be output 
from the mux 84. As mentioned, this configuration is 
necessary because the ALU 86 does not provide this 
functionality and the present system needs this 
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functionality to perform simulations effectively and 
efficiently. 

Similar to the COMPARE and SWITCH functions are the 
MAX and MIN functions. The MAX f miction is described in 
5 detail from which an under standing of the MIN function is 
easily derived. Operation of the MAX function (e.g., 
MAX(X,Y)) is performed by loading registers 80a and 80b 
with the values for the variables X and Y. Next, ALU 86 
is instructed to perform a greater than (">") function. 
10 Next, the condition code genrated by ALU 86 f as was the 

case with the COMPARE and SWITCH functions, being part of 
the mux select, line 83, selects the greater of the two 
values contained in riegisters 82a and 82b . 

Thus, by adding logic to ALU 86, an ACM 48 can 
15 perform not only the functions available in the ALU 86 

but also the additional functions of COMPARE, SWITCH, MAX 
and MIN which make the ACM 48 and, consequently, the 
overall architecture faster, more efficient and better 
adapted for handling limited conditional branching 
20 functions* 

Fig. 5 only provided a detailed diagram of a portion 
of ACM 48, a complete detailed diagram of ACM 48 is 
provided for the sake of completeness in Figs. 6A and 6B^ 

Beginning with Fig. 6A, the data input to ACM 48 via 
25 bus 56b is received by input register 62* The output of 
register 62 is fed into the switch network (element 64 in 
Fig. 4) shown as a group of four three -input muxes: muxes 
100, 102 and 104 are shown in Fig. 6A and mux 106 is 
shown in Fig. 6B. 

30 Also providing an input to the switch network is the 

indexed memory (element 66 in Fig. 4) shown in Fig. 6A as 
two parts. The first part which handles the index 
addressing comprises a RAM 110 which receives its input 
from the switch network. In the exemplary embodiment of 
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the present invention, RAM 110 is a 1Kx32 RAM. RAM 110 
supplies one input of mux 112 where the other input is 
supplied directly from the switch network. The output of 
mux 112 is fed to register 114 which supplies one input 
to adder 116. The other input to adder 116 is supplied 
by the instruction register 113 which gets its input from 
the instruction memory 115 as seen in Pig. 6B. In the 
exemplary embodiment of the present invention, the 
instruction memory is a 16Kxl28 RAM. 

Adder 116 provides an address which is fed to the 
second part of the indexed memory which handles the 
storing of indexed data. The address is fed to a RAM 
120. In the exemplary embodiment of the present 
invention, RAM 120 is a 128Kx32 RAM. RAM 120 supplies 
one input of mux 122 whose other input is supplied by the 
output of RAM 110. The output of mux 122 is registered 
by register 124 and fed into the switch network. 

Pig. 6B shows many of the same details described in 
Pig. 5, thus Pig. 6B is briefly described. In addition 
to what has already been described in conjunction with 
either Pig. 5 or Fig. 6A, Fig. 6B shows the output of mux 
84 feeding register 130. The output of register 130 then 
provides the final input to the switch network (this is 
represented by the output of register 130 feeding the 
third input to mux 106) . Mux 106 is then fed to output 
FIFO 76 which delivers data to bus 56b. The output of 
mux 106 is also supplied to the LED register 132 which 
provides input to the LEDs 134. 

Fig. 6B also shows the control outputs of ALU 86 
supplied to registers 136 and 138. Additionally, the 
program count which is received from the single program 
counter (not shown) , via bus 56c, is fed into the program 
count register 140. The output of register 140 is fed to 
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the instruction address register 142 which then provides 
it to the instruction memory 115. 

Reviewing Figs. 6A and SB, it is clear that because 
an ACM employs registers (e.g., 80a/80b, 82a/82b, 130, 
5 etc), the ACM is divided into stages or pipelines. And 
because the functionality of an ACM is pipelined, in 
general, a new instruction can begin with each new clock 
cycle. Pipelining effectively allows the ACM to process 
instructions in a parallel fashion which compounds the 
10 parallel processing capabilities of the entire 

architecture- In the exemplary embodiment of the present 
invention, the pipelining causes a typical latency of 
five cycles in the ACM. 

Given this description of the hardware, how the 
15 software (the conpiler) takes advantage of the parallel* 
configuration of the programmable ACM's 48 is now 
described. 

Compiler: Scheduling- Function 

To take full advantage of the parallel hardware 
20 configuration, the compiler breaks down the source code 
into a parallel equivalent that can be run on the 
hardware- Because the above -described architecture is 
designed for optimizing parallel scalar operations, in 
the exemplary embodiment of the present invention, the 
25 parallel equivalent is at the level of fine-grained 
parallelism. 

Fine-grained parallelism means that the parallelism 
that exists at the scalar operator level is exploited. 
This results in a large number of elementary tasks. In 
30 . the exemplary embodiment of the present invention, the 

elementary tasks which are supported by the architecture 
are one -input operators and two -input operators (an 
exception to this is the three- input SWITCH function) . 
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Thus, the code is broken down into machine level tasks 
with one or two inputs. 

However, because certain variables depend on other 
variables in a given program or simulation, optimizing 
5 via parallelism has its limitations. As explained with 
reference to the hardware, the controlling limitation is 
known as the critical path limit. The critical path 
limit is based on the order in which certain variables 
must be calculated. And, as previously mentioned, the 
10 requirement that one variable be calculated before 

another is known as "data precedence". It is this data 
precedence which defines the critical path limit and 
which cannot be improved upon by increasing parallel 
resources (i.e. putting more ACM's in the system) . 

15 For example, if Y«A*B+C+D then a typical compiler 

introduces a new variable such that X=A*B and Y=X+C+D. 
At this point, typical compilers continue to scan from 
left to right with the resulting equations looking like 
X=A*B, Z=X+C and Y=Z+D. Thus, a typical compiler, as 

20 seen in Fig. 7A, attempting to optimize the number of 
operations, calculates Y by first multiplying A and B 
then adding that the result to C and, finally, adding 
that result to D. This series of three operations 
results in the use of three sequential time slots. And 

25 assuming for comparative purposes that each time slot 

requires one system cycle, then the above example takes 
three (3) cycles. The above described aspects of 
compiling are well known in the art; however, the 
compiler for the present invention takes this optimizing 

30 a bit further. . 

The compiler aspect of the present invention 
recognizes that different variables are available at 
different times. The compiler of the present invention 
leaves the three- input sum as a three-input sum until 
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the entire source file is scanned and analyzed. Then the 
reduction is performed to minimize the critical path. 
For instance, the input variables C and D are available 
at the beginning because they are input variables. And, 
5 because they are objects of an add operation, they can be 
added during the same time that A and B aire multiplied as 
seen in Fig. 7B. Subsequently, the results of the 
addition (C+D) and multiplication (A*B) are added to 
produce Y. In this version of calculating Y, because the 
10 addition and multiplication of the input variables occurs 
in parallel as seen in Fig^ 7B, the computation only 
takes two sequential time slots. This translates into 
two system cycles. A speed-up of over 33%* 

The way this is accomplished requires first that the 
15 source program be reduced into individual machine 
instructions. This is a straightforward operation 
currently performed by many compilers for conventional 
sequential machines. Each task consists of the 
calculation of a single output variable. In a dataflow 
20 diagram, such as seen in Fig. 1, each such variable may 
be thought of as the output of a "component 11 , analogous 
to an analog component (Fig. 1 and 2a) .. 

To obtain these two input tasks from the multiple 
input source code equations and then schedule them, the 
25 algorithm works by computing two separate values for each 
task. The first value is the earliest finish time (EFT) 
and the second value is the down- stream time (DST) . 

A listing of the source code which performs these 
functions is included as a microfiche appendix. The 
30 microfiche appendix includes two copyrighted 1 C ? 

programs: PHASEl.c and PHASE2 . c . In the exemplary 
embodiment of the present invention, PHASEl.c performs 
the preprocessing of the source code (i.e., textual 
manipulation) and it also performs some of the EFT and. 
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DST sorting and calculating; whereas, PHASE2 . c performs 
the scheduling of scalar operations based on the 
preprocessing, EFT and DST information. 

The EFT is the earliest time at which the task could 
be completed, given an unlimited number of processors and 
no bus contention. It is the length of the longest chain 
of dependent computations starting with state variables 
and constants, and leading to the completion of the task 
(or calculation of the output variable) . The DST is the 
length of the longest path from the output of the 
component (the. variable calculated by the task) to a 
" terminal" variable {one whose ouput is not needed as 
input for any further computations). 

Fig. 8 shows a high-level flowchart illustrating how 
the scheduling function of the compiler works: 

First, the EFT for each task is calculated as indicated 
by block 200. Next, the DST for each variable is 
calculated as indicated by block 204.. Once the EFTs have 
been calculated, block 208, th^y are used to breakdown 
the multiple input operators by grouping operands 
according to their EFTs, block 206. Based on the 
availability of variables, the compiler then schedules 
tasks, block 210. The DST list, block 212, is used to 
resolve conflicts during the scheduling of tasks which 
are "ready" (due to the availability of operands) to be 
scheduled at the same time. 

Calculating the EFT 

As mentioned, the EFT for each task is calculated 
first. All state variables (e.g. integrator outputs) and 
constants have an EFT of zero by definition because they 
are available at the beginning of the step. Because 
these are the only variables available at the beginning 
of the step, any algebraic variable (one whose value 
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depends on the values of other variables) has an EFT 
determined by the EFTs of its input variables and the 
latency of the task generating it. . For example, a * 
multiplier task, represented in source form by Z=X*Y, has 
5 an EFT defined by EFT-(Z) = MAX [EFT (X) , EFT(Y) J + LATENCY * 

where the LATENCY for a multiplier task on a system 
incorporating the present invention is five (5) cycles. 
This simply means that the multiplication can be 
finished, at the earliest, five (5) cycles after both 
10 inputs become available. 

After setting all EFTs for state variables and 
constants to zero, all other EFTs are set to -1 
indicating an undefined value. The algorithm then sweeps 
repeatedly through the tasks, looking for a task whose 

15 inputs have their EFTs defined, and applying the formula 
including the appropriate latency for each operation. 
During the sweep, the algorithm also sorts the variables 
in order of increasing EET which guarantees that any 
variable that drives another variable, whether directly 

20 or indirectly, preceeds it on the list. 

A first example of how the EFT is used to determine 
the most efficient breakdown follows . Suppose in the 
equation Y=A*B+C+D that A,B,C and D are all state 
variables. As mentioned, they would all have EFTs of 
25 zero. If the equation is broken down one level such that 
X=A*B and Y=X+C+D then the EFT of the intermediate 
variable X is five since the inputs have EFTs of zero and 
the operation has a latency of five cycles. 

The desired output Y is the sum of three inputs, 
30 with EFTs of 5, 0 and 0. The scheduler now applies the < 
rule: group together the operations with the smallest 
EFTs. Thus, C and D axe grouped together, rather than X 
and C as was the case in the earlier example for typical 
compilers. 
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Now, the sum of C and D has an EFT of 5, and so does 
the product of A and B. Thus, the output Y has an EFT of 
10 as compared with an EFT of 15 using the reduction 
practices of typical compilers which use the left to 
5 right scan of the source equation. 

A second example assumes the EFTs for A, B, C and D 
are 5, 10, 15 and 20, respectively. In this case, the 
EFT of X= A*B is 15 (MAX(5, 10) +5) and Y would be the sum 
of three inputs with EFTs of 15, 15 and 20. The present 
10 compiler would group X and C to produce an intermediate 
sum with an EFT of 20. The desired output Y is then 
calculated and has an EFT of 25. 

To recapitulate, the optimum method of reduction 
depends on the entire dataflow graph describing the 

15 generation of A, B, C, D and Y. The single equation 

Y=A*B+C+D cannot be optimized in isolation. The present 
invention takes a global approach by reducing the entire 
source program to a single basic block (i.e., block 
without loops) which is possible because of the items 

20 described within such as macro expansion, loop unrolling, 
branch- free COMPARE and SWITCH operations, and others. 

Calculating the DST 

The next step is to calculate the DST for each task 
as indicated by block 204. This is done by sweeping 

25 backwards over the task list, in order of decreasing EFT. 

If a variable X drives a variable Z directly (as with the 
previous example) , then DST(X) < DST(Z) + LATENCY, where 
LATENCY is the latency for the operation generating Z. 
Note that, unlike the similar formula for EFTs, "<"is 

30 used rather than since X may drive other variables 
with longer down stream paths. To perform this 
calculation, all DSTs are initially set to zero, and for 
each variable X that drives Z, the statement 
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IF ( DST(X) .LT. (DST (Z) +LATENCY) ) THEN 

DST (X) =DST ( Z) + LATENCY 

is executed. When this. step is complete, all DSTs are 
correctly calculated. Note that this step requires only 
5 a single pass through the task list, since the variables 
are processed in order of decreasing EFT. 

Scheduling 

Next, a list is prepared of all variables that are 
"available" , that is, whose values have been calculated. 
10 At the beginning, only state variables and constants are 
available. As tasks are scheduled, additional variables 
become available, and are added to the "available" list. 

Any task whose input variables are all available is 
called "ready" task. Initially, only those tasks are 
15 ready whose inputs are either constants or state 

variables. Again, as the scheduling proceeds, additional 
variables become available and, consequently, additional 
tasks become ready and are added to the "ready list". 

On any cycle, tasks are started on every processor 
20 for which there is at least one ready task. Such an 

algorithm is called a "greedy" algorithm because it never 
leaves a processor idle if there is work for it to do. 
Of course, it may happen that no task is ready, because 
all pending tasks are waiting for inputs which are in the 
25 process of being calculated in the pipelines. In that 
case, the processsor is forced to wait until the inputs 
are available. 

Given that each processor starts some task if there 
are tasks ready for it, sometimes a choice needs to be 
30 made between tasks if there is more than one ready. If 

these circumstances arise, the algorithm chooses the task 
with the largest DST, since heuristically, it is the one 
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that is most likely to keep other tasks waiting later. 
If several tasks have equal DST, one is chosen 
arbitrarily (in the exemplary embodiment of the present 
invention, it is the one with the lowest identifying 
5 number) . 

Using this scheduling technique in conjunction with 
the above -described parallel architecture configuration, 
the system's performance is optimized- 

Interleavinq of Expanded Mac ros and Loop Unrolling 

10 An additional feature of the present invention is 

the use of interleaving of macros. Because the exemplary 
embodiment of the present invention has a single program 
counter, it is difficult to handle branching. This is so 
because if the program counter were to encounter a 

15 branch, since it controls all of the ACM's in the system, 
it would cause all of the ACM's, as well as the I/O 
interfaces, to branch. And, if all of the ACM's and I/O 
interfaces in the system were to branch at the same time, 
many of them would be idle during the execution of the 

20 body of the branch. This is the type of inefficiency 
this aspect of the present invention is designed to 
overcome . 

To reduce this inefficiency, the present invention 
employs the use of macros. The present invention uses 

25 macros to perform functions such as sine, cosine, etc. 
Instead of the entire system being idle during a 
subroutine call, the present invention expands the 
function macro into the necessary basic operations. Once 
that is complete, the compiler interleaves these basic 

30 operations with those operations needing to be run for 
the rest of the program, thus, the ACM's will not be 
unnecessarily idle. It should be noted that the use of 
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macro expansion is to allow for computational 
interleaving and not just to avoid linkage overhead. 

Another type of expansion used by the present 
invention occurs during searches. As is well known in 
5 the art, one of the most efficient searching techniques 
for sequential data is the binary search. A binary 
search involves comparing a value, first, against the 
middle value of the sequence of data to be searched. 
Based on the comparison, either the first or second half 

10 of sequence can be eliminated. The same approach is 
repeated for the remaining half of the data and so on 
until the value is found or all data has been exhausted. 
The number of searches needed to be done is approximately 
the logarithm base 2 of the total number of data values. 

15 Knowing this in advance, the present invention simply 

copies the body of the search code that many times (log 
base 2 of number of data values) thus effectively 
"unrolling the loop". The unrolling of the loop creates 
a sequential program with no branches; in this form, the 

20 program is easily run on a system incorporating the 
present invention. 

flftpn^.TanAnng; Parts List and Applications 

All the circuits described above have been 
implemented using off the shelf parts. The simplier 

25 circuits such as registers and muxes abound in the 
market. In Figs. 5, 6A and 6B, the registers, 
multiplexers, ALU and memories are conventional elements. 
For example, the ALU 86 may be a TI8847. The workstation 
and host computer may be an IBM PC/AT. And the compiler 

30 which performs the scheduling function is written in ' C . 

Some specific applications which impacted the design 
of the present invention and for which, as expected, the 
present invention performs exceptionally well include: 1) 
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a six-degree of freedom (DOF) missile simulation, 2) a 
main rocket engine of a space shuttle simulation, 3) 
four-DOP robotic manipulator arm simulation, and 4) a 
small, but very stiff, chemical kinetics application. 

5 Although the invention is illustrated and described 

herein embodied as a multiprocessor computer system 
designed for improved performance in the area of parallel 
scalar operations, the invention is nevertheless not 
intended to be limited to the details shown. Rather, 
10 various modifications may be made in the details within 
the scope and range of equivalents of the claims and 
without departing from the spirit of the invention. 



WO 93/21577 



PCT/US93/03165 



- 26 - 

What is Claimed: 



1 1. A multiprocessor computer system having 

2 bus means for parallel processing of scalar operations, 

3 said system comprising: 

4 a) single program counter means for sequencing 

5 through the addresses of a program, 

6 b) a plurality of arithmetic computational 

7 modules (ACM's) each coupled to the single program 

8 counter means, each ACM comprising : 

9 i) an instruction memory responsive to the 

10 single program counter means for providing a scalar 

11 operation instruction, 

12 ii) means for providing, for each input to 

13 each ACM, an independent operand on each clock cycle, 

14 iii) calculation means for performing, in 

15 accordance with the scalar operation instruction, a 

16 scalax operation on the operands and producing a data 

17 output, 

18 iv) selecting means for receiving operands 

19 and a logical variable to select between operands as a 

20 function of the logical variable to provide a data- 

21 formatted output, and 

22 v) switch network means for receiving the 

23 data-r formatted output from the selecting means and 

24 routing it to the bus means. 
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1 2 . The multiprocessor computer system of claim 

2 1 in which the calculation means includes an aritljmetic 

3 logic unit for comparing the operands and producing a 

4 data output instead of , or in addition to, a condition 

5 code, and 

6 the selecting means including (1) means for 

7 producing at least two data- formatted variables and (2) 

8 multiplexer means for selecting between the data 

9 formatted variables as a function of the logical variable 
10 to provide the data- formatted output. 

1 3, The multiprocessor computer system of claim 

2 1 in which the calculation means includes an arithmetic 

3 logic unit coupled to first and second operand storage 

4 means for comparing the operands in the first operand 

5 storage means and for producing the logical variable, and 

6 the selecting means including multiplexer means 

7 for selecting between the operands in the second operand 

8 storage means as a function of the logical variable to 

9 provide the data- formatted output, 

1 4. The multiprocessor computer system of claim 

2 1 in which each means for providing operands includes 

3 first and second operand memory means coupled to the 

4 instruction memory, said instruction memory including 

5 means for addressing the first and second operand memory 

6 means, wherein one operand memory means receives an 

7 operand and the other- operand memory means provides an 

8 operand, respectively, on each clock cycle. 
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1 5. The multiprocessor computer system of claim 

2 4 in which each of said first and second operand memory 

3 means includes two pairs of operand memories in which 
4. each pair performs simultaneous READ and WRITE 

5 operations, respectively, whereby on each clock cycle, a 

6 total of two operands are received and two operands are 

7 provided. 

1 6. The multiprocessor computer system of 

2 claim 1 wherein said switch network means routes the 

3 data -formatted output as an. operand to the means for 

4 providing operands for subsequent scalar operations . 

1 7. The multiprocessor computer system of 

2 claim 1 which further comprises at least one interface 

3 processor means coupled to the single program counter 

4 means for transfering operands from an external source to 

5 the bus means. 
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